Coupled Replicator Equations for the 
Dynamics of Learning in Multiagent Systems 
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Starting with a group of reinforcement-learning agents we derive coupled replicator equations 
that describe the dynamics of collective learning in multiagent systems. We show that, although 
agents model their environment in a self-interested way without sharing knowledge, a game dynamics 
emerges naturally through environment-mediated interactions. An application to rock-scissors-paper 
game interactions shows that the collective learning dynamics exhibits a diversity of competitive and 
cooperative behaviors. These include quasiperiodicity, stable limit cycles, intermittency, and deter- 
ministic chaos — behaviors that should be expected in heterogeneous multiagent systems described 
by the general replicator equations we derive. 
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Adaptive behavior in multiagent systems is an impor- 
tant interdisciplinary topic that appears in various guises 
in many fields, including biology |l| , computer science , 
economics ||, and cognitive science [[|. One of the key 
common questions is how and whether a group of intelli- 
gent agents truly engages in collective behaviors that are 
more functional than individuals acting alone. 

Suppose that many agents interact with an environ- 
ment and each independently builds a model from its sen- 
sory stimuli. In this simple type of coupled multiagent 
system, collective learning (if it occurs) is a dynamical 
behavior driven by agents' environment-mediated inter- 
action J5|, ^| . Here we show that the collective dynamics 
in multiagent systems, in which agents use reinforcement 
learning 0, can be modeled using a generalized form of 
coupled replicator equations. 

While replicator dynamics were introduced originally 
for evolutionary game theory the relationship be- 
tween reinforcement learning and replicator equations 
has been developed only recently ||. Here, we extend 
these considerations to multiagent systems, introducing 
the theory behind a previously reported game-theoretic 
model [|l3) . We show that replicator dynamics emerges as 
a special case of the continuous-time limit for multiagent 
reinforcement learning systems. The overall approach, 
though, establishes a general framework for dynamical- 
systems analyses of adaptive behavior in collectives. 

Notably, in learning with perfect memory, our model 
reduces to the form of a multipopulation replicator equa- 
tion introduced in Ref . |l(J . For two agents with perfect 
memory interacting via a zero-sum rock-scissors-paper 
game the dynamics exhibits Hamiltonian chaos |L3]]. In 
contrast, as we show here, with memory decay multiagent 
systems generally become dissipative and display the full 
range of nonlinear dynamical behaviors, including limit 
cycles, intermittency, and deterministic chaos. 

Our multiagent model begins with simple reinforce- 
ment learning agents. To simplify the development, we 



Santa Fe Institute Working Paper 02-04-017 

assume that there are two such agents X and Y that at 
each time step take one of N actions: 1 = 1,... ,N. Let 
the probability for X to chose action i be Xi (n) and j/j (n) 
for Y , where n is the number of the learning iterations 
from the initial state at n — 0. The agents' choice dis- 
tributions at time n are x(n) = (xi(n), . . . , xjsr(n)) and 
y(n) = (yi(n), . . . ,y N (n)), with T Ii x l (n) = E^(ra) = 1. 

Let and Rfj denote the reward for X taking action 
i and Y action j at step n, respectively. Given these 
actions, A's and Y's memories, Qf(n) and Qj (n), of 
the past benefits from their actions are governed by 



(n + l)-Qf (n) 
Qj(n + l)-QY(n) 
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a x Qf(n) and (1) 
a Y Ql(n) , 



where ax, ay £ [0, 1) control each agent's memory decay 
rate and Qf(0) = Qj (0) = 0. The agents chose their 
next actions according to the Q's, updating their choice 
distributions as follows: 



Xi(n) 



'(«) 



0xQf{n) 



and yi{n) 



(2) 



where (3x,(3y £ [0, 00] control the learning sensitivity: 
how much the current choice distributions are affected 
by past rewards. Using Eq. (|2|), the dynamic governing 
the change in agent state is given by: 

Xi{n+1) = ^x 3 {ny px(Q * (n+1) - Q * {n)) ' (3) 

and similarly for yi(n + 1). 

Consider the continuous-time limit corresponding to 
agents performing a large number of actions (iterates of 
Eqs. (1)) for each choice-distribution update (iterates of 
Eq. (||)). In this case, we have two different time scales — 
that for agent-agent interactions and for learning. We 
assume that the learning dynamics is very slow compared 
to interactions and so x and y are essentially constant 



2 



during the latter. Then, based on Eq. (||), continuous- 
time learning for agent X is governed by 

±i = Pxx t {Qf - EjQfxj) , (4) 

and for the dynamic governing memory updates we have 

Qf = Rf- a x Qf , (5) 

where is the reward for X choosing action i, aver- 
aged over y's actions during the time interval between 
learning updates. Putting together Eqs. (||), (||), and (||) 
one finds 



X j 
J ; 



[3 x [Rf -XjXijR^+axl, 



x 



(0) 



where I* = ^jXj log(xjfxi) represents the effect of mem- 
ory with decay parameter ax ■ (The continuous-time dy- 
namic of Y follows in a similar manner.) Eq. (^), ex- 
tended to account for any number of agents and actions, 
constitutes our general model for reinforcement-learning 
multiagent systems. 

Simplifying again, assume a fixed relationship between 
pairs (i, j) of X's and Y's actions and between rewards 
for both agents: R* = a,ij and Rjj — bij. Assume fur- 
ther that x and y are independently distributed, then 
the time-average rewards for X and Y become 

Rf = >:,'/,,//, and Rj = ZjbijXj , (7) 

In this restricted case, the continuous-time dynamic is: 



Px[{Ay)i -x-4y] +a x l, 



x 



2i = M(Sx),-yBx]+Qy/, 
Vi 



Y 

i > 



(8) 



where (A)i 



and {B)ij — bij, (Ax)j is the «th ele- 



ment of the vector Ax, and fix and (3y control the time- 
scale of each agent's learning. 

We can regard A and B as X's and F's game-theoretic 
payoff matrices for action i against opponent's action j 
Jl8| . In contrast with game theory, which assumes agents 
have exact knowledge of the game structure and of other 
agent's strategies, reinforcement-learning agents have no 
knowledge of a "game" in which they are playing, only a 
myopic model of the environment — other agent (s) — given 
implicitly via the rewards they receive. Nonetheless, a 
game dynamics emerges — via R x and R Y in Eq. (|^) — as 
a description of the collective's global behavior. 

Given the basic equations of motion for the 
reinforcement-learning multiagent system (Eq. (||)), one 
becomes interested in, on the one hand, the time evolu- 
tion of each agent's state vector in the simplices x € 
and y G Ay and, on the other, the dynamics in the 
higher-dimensional collective simplex (x, y) G x Ay. 
Following Ref. 11 , we transform from (x, y) G Ax x Ay 



to U = (u, v) G R 2 ^- 1 ) with u = (m,... ,ujv_i) and 
v = (vi,... ,ujv-i), where u. k = log(^j + i/a;i) and Vi = 
log(j/i+i/i/i), [i = 1, . . . , N— 1). The result is a new ver- 
sion of our simplified model (Eqs. (||)), useful both for 
numerical stability during simulation and also for analy- 
sis in certain limits: 



where a; 



u% = Px 
Vi = f3y 
i = Oi+XJ 



T,ja,ije Vj + an 
1 + Ej-e^ 

1 + Y,je u i 
— a\ j and bij 



axUi and 



ayVi 



(9) 



b\_j. Since 



the dissipation rate 7 in U is 



X~ + Ej^i = —(N - l)(a x + ay), 



(10) 



Eqs. (^) are conservative when ax = cty = and the 
time average of a trajectory is the Nash equilibrium of 
the game specified by A and B, if a limit set exists in 
the interior of Ax x Ay[^9). Moreover, if the game is 
zero-sum, the dynamics are Hamiltonian in U with 

H = -(EjXjUj + EjyjVj) (11) 
+ log(l + Ej-e"') + log(l + Ej-e^) , 

where (x*,y*) is an interior Nash equilibrium JTT[ . 

To illustrate the dynamical-systems analysis of learn- 
ing in multiagent systems using the above framework, we 
now analyze the behavior of the two-person rock-scissors- 
paper interaction [2C[ ]. This familiar game describes a 
nontransitive three-sided competition: rock beats scis- 
sors, scissors beats paper, and paper beats rock. The 
reward structure (environment) is given by: 



A = 



e x 1 -1 
-1 e x 1 
1 -1 ex 



and B = 



e Y 1 -1 
-1 e Y 1 
1 -1 e Y 



(12) 



where ex,ey G [—1.0,1.0] are the rewards for ties. The 
mixed Nash equilibrium is x* = y * = 1/3, (i = 1,2, 3) — 
the centers of Ax and Ay. If ex = — £y> the game is 
zero-sum. 

In the special case of perfect memory (pt x — cty = 0) 
and with equal learning sensitivity ((3x = Py), the linear 
version (Eqs. (||)) of our model (Eq. (^)) reduces to 
multipopulation replicator equations jlQ ]: 

^ - [(Ay); - x ■ Ay] and ^ = [(Bx)< - y • fix] . 

(13) 

The game-theoretic behavior in this case with rock- 
scissors-paper interactions (Eqs. (fl2|)) was investigated 
in 13 . Here, before contrasting our more general set- 
ting, we briefly recall the behavior in these special cases, 
noting several additional results. 
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FIG. 1: Quasiperiodic tori and chaos: ex = — ey = 0.5, 
ax = ay = 0, and fix = /3y- We give a Poincare section 
(top) on the hyperplane defined by ui = and wi > 0; that 
is, in the (x,y) space: (3 + ex)yi + (3 — ex) J/2 — 2 = and 
(3 + ey)xi + (3 — ey)x2 — 2 < 0. There are 23 randomly 
selected initial conditions with energies H = — l/3(ui + u 2 + 
vi +-u 2 )+log(l + e ul +e U2 )+\og{l + e V1 +e" 2 ) = 2.941693, 
which surface forms the outer border of H < 2.941693. Two 
rows (bottom): Representative trajectories, simulated with 
a 4th-order symplectic integrator ||12|, starting from initial 
conditions within the Poincare section. The upper simplices 
show a torus in the section's upper right corner; see the en- 
larged section at the upper right. The initial condition is 
(x, y) = (0.3, 0.054196, 0.645804, 0.1, 0.2, 0.7). The lower sim- 
plices are an example of a chaotic trajectory passing through 
the regions in the section that are a scatter of dots; the initial 
condition is (x,y) = (0.05,0.35,0.6,0.1,0.2,0.7). 



Figure [l] shows Poincare sections of Eqs. (p^)'s trajec- 
tories on the hyperplane (wi = 0, v\ > 0) and represen- 
tative trajectories in the individual agent simplices Ax 
and Ay. When ex — — ey = 0.0, we expect the system 
to be integrable and only quasiperiodic tori should exist. 
Otherwise, ex = — ey > 0.0, Hamiltonian chaos can oc- 
cur with positive-negative pairs of Lyapunov exponents 
p3[ . The dynamics is very rich, there are infinitely many 
distinct behaviors near the unstable fixed point at the 
center — the classical Nash equilibrium — and a periodic 
orbit arbitrarily close to any chaotic one. Moreover, when 
the game is not zero-sum (ex ^ ey), transients to hete- 
roclinic cycles are observed On the one hand, there 
are intermittent behaviors in which the time spent near 




FIG. 2: Limit cycle (top: ey = 0.025) and chaotic attractors 
(bottom: ey = —0.365), with ex = 0.5, ax — a y — 0.01, and 

Px=f3 Y - 

pure strategies (the simplicial vertices) increases subex- 
ponentially with ex + ey < and, on the other hand, 
with ex + ey > 0, chaotic transients persist; cf. 

Our framework goes beyond these special cases and, 
generally, beyond the standard multipopulation replica- 
tor equations (Eqs. (|l^)) due to its accounting for the ef- 
fects of individual and collective learning and since the re- 
ward structure and the learning rules need not lead to lin- 
ear interactions. For example, if the memory decay rates 
(ax and ay) are positive, the system becomes dissipative 
and exhibits limit cycles and chaotic attractors; see Fig. 
||. Figure || (top) shows a diverse range of bifurcations as 
a function of ey: dynamics on the hyperplane (u\ = 0, 
Vi > 0) projected onto y±. When the game is nearly 
zero-sum, agents can reach the stable Nash equilibrium, 
but chaos can also occur, when ex + ey > 0. Figure || 
(bottom) shows that the largest Lyapunov exponent is 
positive across a significant fraction of parameter space; 
indicating that chaos is common. The dual aspects of 
chaos, irregularity and coherence, imply that agents may 
behave cooperatively or competitively (or switch between 
both) in the collective dynamics. Such global behaviors 
ultimately derive from self-interested, myopic learning. 

Within this framework a number of extensions suggest 
themselves as ways to investigate the emergence of collec- 
tive behaviors. The most obvious is the generalization to 
an arbitrary number of agents with an arbitrary number 
of strategies and the analysis of behaviors in thermody- 
namic limit; see, e.g., Es[ as an alternative approach. It 
is relatively straightforward to develop an extension to 
the linear-reward version (Eqs. (]§)) of our model. For 
three agents X, Y, and Z, one obtains: 

Xi 

(14) 
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ing induces a global game dynamics, and investigated 
some of the resulting periodic, intermittent, and chaotic 
behaviors with simple (linear) rock-scissors-papers game 
interactions. Our model gives a macroscopic description 
of a network of learning agents that can be straightfor- 
wardly extended to model a large number of heteroge- 
neous agents in fluctuating environments. Since deter- 
ministic chaos occurs even in this simple setting, one ex- 
pects that in high-dimensional and heterogeneous pop- 
ulations typical of multiagent systems intrinsic unpre- 
dictability will become a dominant collective behavior. 
Sustaining useful collective function in multiagent sys- 
tems becomes an even more compelling question in light 
of these results. 
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the Postdoctoral Researchers Program at RIKEN. 



FIG. 3: Bifurcation diagram (top) of dissipative (learning 
with memory loss) dynamics projected onto coordinate y\ 
from the Poincare section hyperplane (tii = 0, ill > 0) and the 
largest two Lyapunov exponents Ai and A2 (bottom) as a func- 
tion of ey 6 [—1, 1]. Here with ex = 0.5, ax = oly = 0.01, 
and px = Py- Simulations show that A3 and A4 are always 
negative. 



with tensor (-A)y-/. = a,jk, and similarly for Y and Z. 
Not surprisingly, this is also a conservative system when 
the a's vanish. However, extending the general collec- 
tive learning equations (Eq. (Q)) to multiple agents is 
challenging and so will be reported elsewhere. 

To be relevant to applications, one also needs to de- 
velop a statistical dynamics generalization Jl6| of the 
deterministic equations of motion to account for finite 
and fluctuating numbers of agents and also finite histo- 
ries used in learning. Finally, another direction, espe- 
cially useful if one attempts to quantify collective func- 
tion in large multiagent systems, will be structural and 
information-theoretic analyses Jl7| of local and global 
learning behaviors and, importantly, their differences. 
Analyzing the stored information in each agent versus 
that in the collective, the causal architecture of infor- 
mation flow between an individual agent and the group, 
and how individual and global memories are processed to 
sustain collective function are projects now made possible 
using this framework. 

We presented a dynamical-systems model of collective 
learning in multiagent systems, which starts with rein- 
forcement learning agents and reduces to coupled replica- 
tor equations, demonstrated that individual-agent learn- 
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